Escherichia coli is a gram negative bacterium and it is the most studied bacterial model organism for many reasons like its ability to grow fast on cheap media and the fact that it can be genetically manipulated and modified much easier with respect to eukaryotic organisms like yeast or animals or even other bacteria. When an organism is as studied as E. coli the amount of data available about it can be massive, for this reason we would like to perform a bioinformatic analysis to extract knowledge from on one of the many RNA-seq dataset available for E. coli.
This project aims to establish a proxy for determining the activity of transcription factors based on the expression levels of the genes regulated by those factors. In prokaryotes genes with related functions are encoded together within the same genome block, which is called operon, and they are usually transcribed together in a so called polycistronic transcript because they are under the control of the same promoter.
The null hypothesis posits that the abundance of transcripts serves as a reliable proxy for activity of the regulator. However, this assumption may not hold for all transcription factors. Some factors are active only when phosphorylated, meaning that their transcripts may be constitutively expressed while their activity depends on phosphorylation, correlating instead with the activity of the corresponding kinase.
We will also focus on developing a general model to relate transcription factor activity to transcript abundance. Significant deviations from this model could indicate transcription factors whose activity is not accurately reflected by transcript levels. One challenge of this approach is the fact that genes of the same operon may be highly co-expressed together which is a problem for many models like linear regression, so we will try to address it during our analysis.
Here are the steps that we performed in our analysis
Exploratory analysis
Models …
library(readr)
library(matrixStats)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)
library(caret)
library(glmnet)
library(igraph)
library(vip)
library(ggrepel)
set.seed(123)
Let’s start importing the dataset from PreciseDB which is an RNA-seq compendium for Escherichia coli from the Systems Biology Research group at UC San Diego. that contains the expression levels for 3,923 genes in E. coli for 278 different condition. The approach they used was culturing bacteria on many different media, then they measured gene expression values. Then they proceeded to knock-out many different genes for each condition and they measured expression again expecting variations of expression, for example they knocked out a gene for iron metabolism on a medium containing iron and this may for sure affect the expression levels of the E.coli culture.
# Import file
log_tpm <- read.csv("log_tpm_full.csv", row.names = 1)
log_tpm
Next we want to perform some pre-processing. First we want to exclude:
Condition with knock-out genes
Genes with isoforms
For the nature of our analysis we are not interested in the conditions in which a gene is knocked-out because the influence of the knock-out gene may be huge on its related pathways and we excluded them because we want unbiased results. Researchers of PreciseDB also collected expression data about many gene isoforms and for the same genes they did not collect data about the full transcript. During the preprocess this would have resulted in having unmappable bnumbers and duplicate genes so we removed them from the dataset before proceeding.
# Exclude condition with knockout genes
log_tpm <- log_tpm[, -grep("_del", colnames(log_tpm))]
# Drop genes with isoform
genes_with_isoforms <- grep("_\\d+$", rownames(log_tpm), value = TRUE)
log_tpm <- subset(log_tpm, !(rownames(log_tpm) %in% genes_with_isoforms))
dim(log_tpm)
[1] 4257 188
Now we can proceed to import the regulatory network from RegulonDB that reports the target genes for each regulator.
regulator <- read.table(file="tableDataReg.csv",
header=TRUE,
sep=",")
regulator
We decide to eliminate:
Entries that are not referred to a protein regulator.
Relationships reported as Weak or Unknown to work only on meaningful interactions.
regulator <- regulator[which(regulator[, 2] != "ppGpp"), ]
w <- which(trimws(regulator[,7])=="W")
if(length(w)>0){
regulator <- regulator[-w,]
}
w <- which(trimws(regulator[,7])=="?")
if(length(w)>0){
regulator <- regulator[-w,]
}
nrow(regulator)
[1] 4229
There is a discrepancy between the gene identifiers present in the regulatory network obtained from RegulonDB and those in the PreciseDB dataset. While the regulatory network uses gene names, our dataset contained identifiers in the form of bnumbers. So it’s better to convert bnumbers to gene names to facilitate comparison and downstream analysis. In addition we decided to:
Remove entries with bnumbers that don’t map with a gene name
Remove duplicate genes
# Loading files from ECOcyc
map_bnum <- read.delim("mapbnum.txt", header = TRUE)
map_bnum <- map_bnum[c("Gene.Name", "Accession.1")]
# Map between my dataset and the file of ecocyc
log_tpm$gene_number <- rownames(log_tpm)
log_tpm <- merge(log_tpm, map_bnum, by.x = "gene_number", by.y = "Accession.1", all.x = TRUE)
# Rearrange the dataset
log_tpm <- log_tpm[, c("Gene.Name", setdiff(names(log_tpm), "Gene.Name"))]
# Removing unmapped genes bnumber
log_tpm <- subset(log_tpm, !is.na(Gene.Name))
#removing duplicate genes (it also has all expression values equal 0 so very bad)
log_tpm <- subset(log_tpm, !(log_tpm$Gene.Name == "insI2"))
#setting rownames and dropping the first 2 columns
rownames(log_tpm) <- log_tpm$Gene.Name
log_tpm <- log_tpm[,3:ncol(log_tpm)]
#transpose log_tpm to have genes as columns and conditions as rows
log_tpm <- t(log_tpm)
We would like to verify if we can use one the four major summary statistics (mean, median, maximum, minimum of expression) of each gene to model the relationship between the targets and their regulator. Now let’s analyze the distribution of these statistics to understand which is the best value to choose, if any, then we also want check if they follow a Gaussian distribution using the Shapiro–Wilk test which computes the W statistics as follows:
\[ W = \dfrac{\big(\sum^n _ {i=n} a_i x_{(i)} \big ) ^2}{\sum^n _ {i=n} (x_i - \bar{x})^2} \]
Then this W value is used for the hypothesis testing that follows:
\[ H_o: \text{a sample } x_1, \cdots, x_n \text{ is drawn from a normally distributed population.} \\ H_a: \text{a sample } x_1, \cdots, x_n \text{ is not drawn from a normally distributed population.} \]
If the test returns a p-value lower then the threshold 0.05 it means that the data do not follow a normal distribution.
# Histograms of Summary Statistics
log_tpm_mean <- data.frame(value = apply(log_tpm, 2, mean))
mean_hist <- ggplot(log_tpm_mean, aes(x = value)) +
geom_histogram(binwidth = 0.5, fill = "skyblue",
color = "black", bins = 100) +
labs(x = "Mean log-TPM", y = "Frequency") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
shapiro.test(log_tpm_mean$value) #not gaussian
Shapiro-Wilk normality test
data: log_tpm_mean$value
W = 0.98717, p-value < 2.2e-16
log_tpm_median <- data.frame(value = apply(log_tpm, 2, median))
median_hist <- ggplot(log_tpm_median, aes(x = value)) +
geom_histogram(binwidth = 0.5, fill = "lightgreen",
color = "black", bins = 100) +
labs(x = "Median log-TPM", y = "Frequency") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
shapiro.test(log_tpm_median$value) #not gaussian
Shapiro-Wilk normality test
data: log_tpm_median$value
W = 0.98578, p-value < 2.2e-16
log_tpm_max <- data.frame(value = apply(log_tpm, 2, max))
max_hist <- ggplot(log_tpm_max, aes(x = value)) +
geom_histogram(binwidth = 0.5, fill = "lavender",
color = "black", bins = 100) +
labs(x = "Max log-TPM", y = "Frequency") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
shapiro.test(log_tpm_max$value) #not gaussian
Shapiro-Wilk normality test
data: log_tpm_max$value
W = 0.99611, p-value = 3.889e-09
log_tpm_min <- data.frame(value = apply(log_tpm, 2, min))
min_hist <- ggplot(log_tpm_min, aes(x = value)) +
geom_histogram(binwidth = 0.5, fill = "lightpink",
color = "black", bins = 100) +
labs(x = "Min log-TPM", y = "Frequency") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
shapiro.test(log_tpm_min$value) #not gaussian
Shapiro-Wilk normality test
data: log_tpm_min$value
W = 0.91615, p-value < 2.2e-16
grid.arrange(mean_hist, median_hist, max_hist, min_hist, nrow = 2, ncol = 2,
top = textGrob("Histograms of Summary Statistics",
gp=gpar(fontsize=16)))
Based on the results of the Shapiro-Wilk tests conducted on the four
distributions and their respective histograms, we can conclude that the
distributions do not follow to a Gaussian (normal) distribution.
Consequently, this implies that choosing measures such as the mean,
median, maximum, or minimum may not match the linear regression
assumption that the data follows a Gaussian (normal) distribution.
Now we plot some histograms to have a better idea on the number of positive and negative interactions of each regulator.
# Histogram of how many genes are regulated by each gene
positive_reg <- regulator[regulator$X6.function == "+",]
negative_reg <- regulator[regulator$X6.function == "-",]
unique_regulators <- unique(regulator$X3.RegulatorGeneName)
pos_counts <- c()
neg_counts <- c()
for(reg in unique_regulators){
pos_counts <- c(pos_counts,
count(positive_reg[positive_reg$X3.RegulatorGeneName == reg,]))
neg_counts <- c(neg_counts,
count(negative_reg[negative_reg$X3.RegulatorGeneName == reg,]))
}
pos_counts <- unlist(pos_counts)
names(pos_counts) <- unique_regulators
neg_counts <- unlist(neg_counts)
names(neg_counts) <- unique_regulators
pos_counts <- data.frame(value = pos_counts)
shapiro.test(pos_counts$value) #not gaussian
Shapiro-Wilk normality test
data: pos_counts$value
W = 0.29429, p-value < 2.2e-16
neg_counts <- data.frame(value = neg_counts)
shapiro.test(neg_counts$value) #not gaussian
Shapiro-Wilk normality test
data: neg_counts$value
W = 0.25872, p-value < 2.2e-16
total_counts <- data.frame(value = pos_counts$value + neg_counts$value)
shapiro.test(total_counts$value) #not gaussian
Shapiro-Wilk normality test
data: total_counts$value
W = 0.29589, p-value < 2.2e-16
pos_counts_hist <- ggplot(pos_counts, aes(x = value)) +
geom_histogram(binwidth = 10, fill = "skyblue",
color = "black", bins = 20) +
labs(x = "Positive Regulations Count", y = "Frequency") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
neg_counts_hist <- ggplot(neg_counts, aes(x = value)) +
geom_histogram(binwidth = 10, fill = "lightgreen",
color = "black", bins = 20) +
labs(x = "Negative Regulations Count", y = "Frequency") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
total_counts_hist <- ggplot(total_counts, aes(x = value)) +
geom_histogram(binwidth = 10, fill = "lightpink",
color = "black", bins = 20) +
labs(x = "Total Regulations Count", y = "Frequency") +
theme_minimal() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
grid.arrange(pos_counts_hist, neg_counts_hist, total_counts_hist, ncol = 1)
These histograms show how many genes are regulated by how many regulators; for example there are more than 150 genes that positively interact with 0 genes, this is not surprising because we removed weak and unknown interactions, we can also see that there is a gene that negatively controls a number of genes close to 400. There are slightly more negative regulators then positive. In addition we can also notice that the majority of regulators have less than 100 targets. There is one exception: crp has 300 target.
# Creating adj matrix to plot the network
edge_list <- cbind(RagulatorName = regulator$X3.RegulatorGeneName,
Target = regulator$X5.regulatedName)
graph <- graph_from_edgelist(edge_list)
layout <- layout_with_fr(graph, niter=100)
# Visualizzazione 3D della network
plot.igraph(graph, layout=layout, vertex.label=NA, vertex.size=3, edge.arrow.size=0.2, edge.curved=TRUE, main="E.coli Network", xlim=c(-0.5, 0.5), ylim=c(-1, 1))